Skip to content

[XPU] add build_sampling_params op.#7738

Open
Jiajun-Ji wants to merge 7 commits into
PaddlePaddle:developfrom
Jiajun-Ji:xpu-build_sampling_params
Open

[XPU] add build_sampling_params op.#7738
Jiajun-Ji wants to merge 7 commits into
PaddlePaddle:developfrom
Jiajun-Ji:xpu-build_sampling_params

Conversation

@Jiajun-Ji
Copy link
Copy Markdown
Contributor

@Jiajun-Ji Jiajun-Ji commented May 7, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

将XPU下的padding_sampling_params的py实现改为XPU kernel实现build_sampling_params,此外将infer_seed更新收敛到build_sampling_params内部,并将infer_seed的increment_value步进对齐GPU实现。

Modifications

Usage or Command

Accuracy Tests

image

测试XPU kernel内INT64取模正常
027cc45028a6fa55e4ab371c3ab9c101

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings May 7, 2026 10:44
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 7, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 为 XPU 后端新增 build_sampling_params 自定义算子,用 XPU kernel 替换原先 Python 的 sampling 参数 padding 逻辑,并尝试将 infer_seed 的更新收敛到算子内部,以对齐 GPU 的 seed 步进策略(尤其在 speculative decoding 场景)。

Changes:

  • 新增 XPU build_sampling_params kernel + plugin wrapper + Paddle static op,并在 XPU speculative verify(TARGET_MATCH)路径中接入。
  • XPU ModelRunner 侧引入 increment_value(对齐 GPU:非 speculative 为 4,speculative 为 (num_speculative_tokens+1)*4),并调整 infer_seed 的更新时机。
  • 新增 custom_ops/xpu_ops/test/test_build_sampling_params.py 单测,对比 Python 参考实现并覆盖多种 batch 形态与 wrap-around。

PR 元信息检查(需补充)

  • 标题已包含 [XPU] tag,格式符合要求。
  • 描述中 “Modifications / Usage or Command / Accuracy Tests” 等小节未补全;若该算子会影响采样结果或可复现性,建议补充 accuracy 对比与对应运行命令/环境信息;如不加单测或无法跑到 XPU CI,也需注明原因(本 PR 已新增单测文件,但仍建议在描述里给出如何运行)。

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
fastdeploy/worker/xpu_model_runner.py 计算并下发 increment_value,并调整 speculative 场景下 infer_seed 的更新逻辑
fastdeploy/model_executor/layers/sample/sampler.py XPU verify(TARGET_MATCH) 路径改用 build_sampling_params,并透传 increment_value
custom_ops/xpu_ops/test/test_build_sampling_params.py 新增 XPU op 单测,与 Python 参考实现对齐校验
custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp 新增 plugin wrapper(CPU + XPU3 分发)
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu 新增 Kunlun3 XPU kernel 实现
custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h 导出 build_sampling_params 声明
custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc 新增 Paddle static op 注册与调用桥接

# 7. Updata 'infer_seed' and step_paddle()
self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
if not self.speculative_decoding:
Comment on lines +1124 to +1127
share_inputs["seq_lens_this_time"],
share_inputs["seq_lens_encoder"],
token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]),
increment_value=increment_value,
Comment on lines +36 to +38
api::Context* ctx = xpu_ctx->x_context();
if (top_p.is_cpu()) {
ctx = new api::Context(api::kCPU);
Comment on lines +50 to +57
// Shared prefix-sum buffer: each cluster computes its own pad_start via
// a two-pass scan over seq_lens_this_time / seq_lens_encoder.
// We use a simple approach: core 0 of cluster 0 writes per-batch start
// offsets into a global scratch area is not available here, so instead we
// compute pad_start with a sequential scan in core 0 of each cluster.
// Because clusters run concurrently we cannot share a global accumulator;
// instead each cluster independently sums the first `bi` entries.
// This is O(bs) per cluster but bs is typically small (<=512).
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 7, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 15:01:04

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

存在 1 个 Required 任务失败Approval),2 个 Required 任务运行中,当前 CI 不满足合并条件,需处理 Approval 失败问题。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 34 5 2 1 0

2 任务状态汇总

2.1 Required任务 : 7/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 7s PR问题:缺少自定义OP所需FastDeploy RD和PaddlePaddle RD各1人审批 请求@qingqing01等FD RD和@jeff41404等PP RD Approve Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
其余 7 个必选任务通过 - - - - -

2.2 可选任务 — 27/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 43m37s Job -
Trigger Jenkins for PR 22m47s Job -
xpu_unit_test / run_xpu_unit_test 6m7s Job -
Check PR Template 14s Job -
⏸️ CI_HPU - - -
其余 27 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — PR流程(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: PR流程(自定义OP审批)
  • 置信度: 高
  • 根因摘要: PR添加自定义OP,缺失FastDeploy RD和PaddlePaddle RD各1人审批
  • 分析器: 通用分析(fallback)

根因详情:
check_approval.sh 检测到该 PR 新增了自定义算子(custom op),触发了额外的审批要求。脚本报告 "There are 2 approved errors",即:(1) 未获得 FastDeploy RD(qingqing01/Jiang-Jia-Jun/heavengate)审批;(2) 未获得 PaddlePaddle RD(jeff41404/yongqiangma)审批。退出码 6 表示两项审批均缺失。

关键日志:

0. You must have one FastDeploy RD (qingqing01(dangqingqing), Jiang-Jia-Jun(jiangjiajun), heavengate(dengkaipeng)) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404(gaoxiang), yongqiangma(mayongqiang)) approval for adding custom op.

There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 请求以下 FastDeploy RD 中的一位 Review 并 Approve:@qingqing01(dangqingqing)/ @Jiang-Jia-Jun(jiangjiajun)/ @heavengate(dengkaipeng)
  2. 请求以下 PaddlePaddle RD 中的一位 Review 并 Approve:@jeff41404(gaoxiang)/ @yongqiangma(mayongqiang)

修复建议摘要: 请求@qingqing01等FastDeploy RD和@jeff41404等PaddlePaddle RD Approve

关联变更: PR 标题 [XPU] add build_sampling_params op,涉及新增 XPU 自定义算子
链接: 查看日志

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 7, 2026

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f928343). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/xpu_model_runner.py 0.00% 4 Missing ⚠️
fastdeploy/model_executor/layers/sample/sampler.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7738   +/-   ##
==========================================
  Coverage           ?   63.15%           
==========================================
  Files              ?      461           
  Lines              ?    64129           
  Branches           ?     9824           
==========================================
  Hits               ?    40501           
  Misses             ?    20852           
  Partials           ?     2776           
Flag Coverage Δ
GPU 72.27% <0.00%> (?)
XPU 7.13% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI review requested due to automatic review settings May 8, 2026 03:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

top_k=sampling_metadata.top_k,
top_k_list=sampling_metadata.top_k_list,
topp_seed=topp_seed,
topp_seed=sampling_metadata.topp_seed,
Comment on lines +165 to +167
self.increment_value = (
4 if not self.speculative_decoding else (self.speculative_config.num_speculative_tokens + 1) * 4
)
# 7. Updata 'infer_seed' and step_paddle()
self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
if not self.speculative_decoding:
Comment thread custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc
Comment on lines +50 to +84
// Shared prefix-sum buffer: each cluster computes its own pad_start via
// a two-pass scan over seq_lens_this_time / seq_lens_encoder.
// We use a simple approach: core 0 of cluster 0 writes per-batch start
// offsets into a global scratch area is not available here, so instead we
// compute pad_start with a sequential scan in core 0 of each cluster.
// Because clusters run concurrently we cannot share a global accumulator;
// instead each cluster independently sums the first `bi` entries.
// This is O(bs) per cluster but bs is typically small (<=512).

for (int bi = clusterid; bi < bs; bi += nclusters) {
if (cid == 0) {
// Read per-batch parameters from global memory.
float lm_top_p;
int64_t lm_top_k;
int64_t lm_seed;
int lm_slt; // seq_lens_this_time[bi]
int lm_sle; // seq_lens_encoder[bi]

GM2LM_ASYNC(top_p + bi, &lm_top_p, sizeof(float));
GM2LM_ASYNC(top_k + bi, &lm_top_k, sizeof(int64_t));
GM2LM_ASYNC(infer_seed + bi, &lm_seed, sizeof(int64_t));
GM2LM_ASYNC(seq_lens_this_time + bi, &lm_slt, sizeof(int));
GM2LM(seq_lens_encoder + bi, &lm_sle, sizeof(int)); // sync barrier

bool is_decoder = (lm_sle == 0);
int repeat = is_decoder ? lm_slt : 1;

// Compute pad_start = sum of token counts for batches [0, bi).
int pad_start = 0;
for (int k = 0; k < bi; k++) {
int slt_k, sle_k;
GM2LM_ASYNC(seq_lens_this_time + k, &slt_k, sizeof(int));
GM2LM(seq_lens_encoder + k, &sle_k, sizeof(int));
pad_start += (sle_k == 0) ? slt_k : 1;
}
Comment thread benchmarks/error_output.txt Outdated
Comment on lines +1 to +2
RequestFuncOutput(no=2347, request_id='None', generated_text='', reasoning_content='', success=False, latency=0.0, end_timestamp=0.0, output_tokens=0, ttft=0.0, arrival_time=[], itl=[], tpot=0.0, prompt_len=0, prompt_tokens=0, reasoning_tokens=0, res_ttft=0, error='{"error":{"message":"request[chatcmpl-814e8d96-3da8-46b0-b4da-31925c313041] generator error: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192), Traceback (most recent call last):\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/openai/serving_chat.py\\", line 168, in create_chat_completion\\n prompt_token_ids = await self.engine_client.format_and_add_data(current_req_dict)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 300, in format_and_add_data\\n await self.add_requests(request)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 390, in add_requests\\n raise EngineError(error_msg, error_code=400)\\nfastdeploy.utils.EngineError: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192)\\n","type":"invalid_request_error","param":null,"code":null}}', metrics={}, tool_calls=[], output_ids=[])
RequestFuncOutput(no=2347, request_id='None', generated_text='', reasoning_content='', success=False, latency=0.0, end_timestamp=0.0, output_tokens=0, ttft=0.0, arrival_time=[], itl=[], tpot=0.0, prompt_len=0, prompt_tokens=0, reasoning_tokens=0, res_ttft=0, error='{"error":{"message":"request[chatcmpl-799cdf97-ab7e-4823-80e4-1833bf5f7d90] generator error: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192), Traceback (most recent call last):\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/openai/serving_chat.py\\", line 168, in create_chat_completion\\n prompt_token_ids = await self.engine_client.format_and_add_data(current_req_dict)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 300, in format_and_add_data\\n await self.add_requests(request)\\n File \\"/home/paddle_test/works/fd/FastDeploy/fastdeploy/entrypoints/engine_client.py\\", line 390, in add_requests\\n raise EngineError(error_msg, error_code=400)\\nfastdeploy.utils.EngineError: Input text is too long, input_ids_len (8191) + min_tokens(1) >= max_model_len(8192)\\n","type":"invalid_request_error","param":null,"code":null}}', metrics={}, tool_calls=[], output_ids=[])
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 8, 2026 08:14
@Jiajun-Ji Jiajun-Ji force-pushed the xpu-build_sampling_params branch from 651d7cb to cfc5936 Compare May 8, 2026 08:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Comment on lines 1075 to 1081
_, next_tokens = top_k_top_p_sampling(
probs,
top_p=top_p,
top_k=top_k,
top_p=sampling_metadata.top_p,
top_k=sampling_metadata.top_k,
top_k_list=sampling_metadata.top_k_list,
topp_seed=topp_seed,
topp_seed=sampling_metadata.topp_seed,
)
Comment on lines 1118 to 1123
sampling_metadata.seed,
paddle.reshape(share_inputs["seq_lens_this_time"], shape=[-1]),
paddle.reshape(share_inputs["seq_lens_encoder"], shape=[-1]),
share_inputs["seq_lens_this_time"],
share_inputs["seq_lens_encoder"],
token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]),
increment_value=increment_value,
)
self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
if not self.speculative_decoding:
self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
Comment thread custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc
Comment on lines +50 to +64
int64_t pad_idx = 0;
for (int bi = 0; bi < bs; bi++) {
bool is_decoder = (seq_lens_encoder[bi] == 0);
int repeat = is_decoder ? seq_lens_this_time[bi] : 1;
int64_t bi_seed = infer_seed[bi];
for (int local_pos = 0; local_pos < repeat; local_pos++) {
int64_t offset = is_decoder ? static_cast<int64_t>(local_pos) * 4 : 0LL;
top_p_padding[pad_idx] = top_p[bi];
top_k_padding[pad_idx] = top_k[bi];
topp_seed[pad_idx] = (bi_seed + offset) % BUILD_SAMPLING_MAX_INFER_SEED;
pad_idx++;
}
infer_seed[bi] =
(infer_seed[bi] + increment_value) % BUILD_SAMPLING_MAX_INFER_SEED;
}
Comment on lines +50 to +84
// Shared prefix-sum buffer: each cluster computes its own pad_start via
// a two-pass scan over seq_lens_this_time / seq_lens_encoder.
// We use a simple approach: core 0 of cluster 0 writes per-batch start
// offsets into a global scratch area is not available here, so instead we
// compute pad_start with a sequential scan in core 0 of each cluster.
// Because clusters run concurrently we cannot share a global accumulator;
// instead each cluster independently sums the first `bi` entries.
// This is O(bs) per cluster but bs is typically small (<=512).

for (int bi = clusterid; bi < bs; bi += nclusters) {
if (cid == 0) {
// Read per-batch parameters from global memory.
float lm_top_p;
int64_t lm_top_k;
int64_t lm_seed;
int lm_slt; // seq_lens_this_time[bi]
int lm_sle; // seq_lens_encoder[bi]

GM2LM_ASYNC(top_p + bi, &lm_top_p, sizeof(float));
GM2LM_ASYNC(top_k + bi, &lm_top_k, sizeof(int64_t));
GM2LM_ASYNC(infer_seed + bi, &lm_seed, sizeof(int64_t));
GM2LM_ASYNC(seq_lens_this_time + bi, &lm_slt, sizeof(int));
GM2LM(seq_lens_encoder + bi, &lm_sle, sizeof(int)); // sync barrier

bool is_decoder = (lm_sle == 0);
int repeat = is_decoder ? lm_slt : 1;

// Compute pad_start = sum of token counts for batches [0, bi).
int pad_start = 0;
for (int k = 0; k < bi; k++) {
int slt_k, sle_k;
GM2LM_ASYNC(seq_lens_this_time + k, &slt_k, sizeof(int));
GM2LM(seq_lens_encoder + k, &sle_k, sizeof(int));
pad_start += (sle_k == 0) ? slt_k : 1;
}
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 12, 2026 06:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Comment on lines 1074 to 1081
"""Normal sampling for NAIVE mode on XPU."""
top_p, top_k, topp_seed = padding_sampling_params(
sampling_metadata.top_p,
sampling_metadata.top_k,
sampling_metadata.seed,
paddle.reshape(share_inputs["seq_lens_this_time"], shape=[-1]),
paddle.reshape(share_inputs["seq_lens_encoder"], shape=[-1]),
)
_, next_tokens = top_k_top_p_sampling(
probs,
top_p=top_p,
top_k=top_k,
top_p=sampling_metadata.top_p,
top_k=sampling_metadata.top_k,
top_k_list=sampling_metadata.top_k_list,
topp_seed=topp_seed,
topp_seed=sampling_metadata.topp_seed,
)
Comment on lines +1119 to 1123
share_inputs["seq_lens_this_time"],
share_inputs["seq_lens_encoder"],
token_num_output_cpu=int(share_inputs["cu_seqlens_q_output"][-1]),
increment_value=increment_value,
)
Comment on lines 1451 to +1454
# 7. Updata 'infer_seed' and step_paddle()
self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
if not self.speculative_decoding:
self.share_inputs["infer_seed"].add_(self.infer_seed_increment)
self.share_inputs["infer_seed"][:] %= self.MAX_INFER_SEED
Comment on lines +91 to +101
PD_BUILD_STATIC_OP(build_sampling_params)
.Inputs({"top_p",
"top_k",
"infer_seed",
"seq_lens_this_time",
"seq_lens_encoder"})
.Outputs({"top_p_padding", "top_k_padding", "topp_seed"})
.Attrs({"token_num_output_cpu: int64_t", "increment_value: int64_t"})
.SetKernelFn(PD_KERNEL(BuildSamplingParams))
.SetInferShapeFn(PD_INFER_SHAPE(BuildSamplingParamsInferShape))
.SetInferDtypeFn(PD_INFER_DTYPE(BuildSamplingParamsInferDtype));
Comment on lines +50 to +57
// Shared prefix-sum buffer: each cluster computes its own pad_start via
// a two-pass scan over seq_lens_this_time / seq_lens_encoder.
// We use a simple approach: core 0 of cluster 0 writes per-batch start
// offsets into a global scratch area is not available here, so instead we
// compute pad_start with a sequential scan in core 0 of each cluster.
// Because clusters run concurrently we cannot share a global accumulator;
// instead each cluster independently sums the first `bi` entries.
// This is O(bs) per cluster but bs is typically small (<=512).
Comment on lines +15 to +25
"""
Unit tests for build_sampling_params XPU op.

Verifies that the XPU kernel produces the same output as the Python reference
implementation (padding_sampling_params) for all cases:
- pure decoder batches (seq_lens_encoder == 0)
- pure encoder batches (seq_lens_encoder > 0)
- mixed encoder/decoder batches
- single-item batch (bs=1)
- seed wrap-around near MAX_INFER_SEED
"""
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 14:22:53

📋 Review 摘要

PR 概述:将 XPU 下 padding_sampling_params 的 Python 实现替换为 XPU kernel 实现 build_sampling_params,同时将 infer_seed 更新收敛到 kernel 内部,并对齐 GPU 的 increment_value 步进。
变更范围custom_ops/xpu_ops/(新 kernel + wrapper + op 注册)、fastdeploy/model_executor/layers/sample/sampler.pyfastdeploy/worker/xpu_model_runner.py
影响面 Tag[XPU] [OP]

📝 PR 规范检查

标题含合法 Tag [XPU],格式合规。但 ## Modifications## Usage or Command 两个 section 内容缺失(仅保留模板注释),Checklist 均未勾选。

标题建议(可直接复制):

  • [XPU][OP] Add build_sampling_params XPU kernel to replace Python padding_sampling_params

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
将 XPU 下 `padding_sampling_params` 的纯 Python 实现改为 XPU kernel 实现 `build_sampling_params`,以减少主机-设备同步开销。此外将 `infer_seed` 的更新逻辑收敛到 kernel 内部,并将 increment_value 步进从原来的 XPU 固定值 4 对齐至 GPU 实现(非投机解码: 4,投机解码: (num_speculative_tokens + 1) * 4)。

## Modifications
- 新增 XPU kernel:`custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu`,实现 top_p/top_k/seed padding 及 infer_seed in-place 更新
- 新增 C++ wrapper:`custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp`,支持 CPU 和 XPU3 两条执行路径
- 新增 Paddle Custom Op 注册:`custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc`
-`plugin.h` 中声明 `build_sampling_params` 接口
- `sampler.py`:在 `_verify_and_sample_xpu` 中用 `build_sampling_params` 替换 `padding_sampling_params``_normal_sample_xpu` 直接使用 `sampling_metadata` 字段
- `xpu_model_runner.py`:计算 `increment_value`,非投机解码时继续在 Python 层更新 `infer_seed`,投机解码时由 kernel 内部更新
- 新增单测:`custom_ops/xpu_ops/test/test_build_sampling_params.py`,覆盖纯解码、纯编码、混合、单 batch、seed 回绕等场景

## Usage or Command
N/A(内部实现替换,对外接口不变)

## Accuracy Tests
测试 XPU kernel 内 INT64 取模正常(见 PR 内截图);精度测试结果与原 Python 实现一致。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🔴 Bug fastdeploy/model_executor/layers/sample/sampler.py:1080 sampling_metadata.topp_seed 属性不存在,NAIVE 模式 XPU 采样将在运行时崩溃

总体评价

新增 XPU kernel 实现思路清晰,CPU/XPU3 双路径覆盖完整,单测用例充分。但 _normal_sample_xpu 中使用了 sampling_metadata.topp_seed 这一不存在的属性(SamplingMetadata 只有 seed 字段),将导致非投机解码 XPU 路径在运行时直接崩溃,需修复后合入。

top_k=sampling_metadata.top_k,
top_k_list=sampling_metadata.top_k_list,
topp_seed=topp_seed,
topp_seed=sampling_metadata.topp_seed,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug sampling_metadata.topp_seed 属性不存在,将导致 AttributeError

SamplingMetadata 数据类(meta_data.py)中只有 seed 字段,没有 topp_seed。当 XPU NAIVE 模式(非投机解码)调用 _normal_sample_xpu 时,该行会在运行时抛出 AttributeError: 'SamplingMetadata' object has no attribute 'topp_seed'

建议修复方案(二选一):

  1. 直接使用已有的 seed 字段(但需注意 seed 是否已经是 padded 形式):
topp_seed=sampling_metadata.seed,
  1. 若需要 padded seed,应在 SamplingMetadata 中新增 topp_seed: Optional[paddle.Tensor] = None,并在 xpu_model_runner.pySamplingMetadata(...) 构造处通过 padding_sampling_params(或新的 build_sampling_params kernel)填充该字段。

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-13 14:35:06

📋 Review 摘要

PR 概述:将 XPU 下 padding_sampling_params 的 Python 实现替换为 XPU kernel 实现 build_sampling_params,同时将 infer_seed 更新逻辑内聚到 kernel 内部(speculative 路径)并对齐 GPU 的步进值。
变更范围custom_ops/xpu_ops/(新增 kernel/wrapper/op 注册)、sampler.pyxpu_model_runner.py
影响面 Tag[XPU] [OP]

📝 PR 规范检查

标题含 [XPU] Tag 合规。但描述中 ## Modifications## Usage or Command 段内容为空(仅含模板注释),## Checklist 条目均未勾选,建议按如下模板补全。

标题建议(可直接复制):

  • [XPU][OP] Add build_sampling_params XPU kernel to replace padding_sampling_params

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
将 XPU 下的 `padding_sampling_params` Python 实现改为 XPU kernel 实现 `build_sampling_params`,在 kernel 内完成采样参数 padding 操作。同时将 `infer_seed` 更新逻辑收敛到 `build_sampling_params` 内部(speculative decoding 路径),并将 `infer_seed``increment_value` 步进对齐 GPU 实现(每 token 步进 4)。

## Modifications
- `custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu`:新增 XPU3 kernel,支持 decoder/encoder 混合 batch,cluster 0 原地更新 `infer_seed`
- `custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp`:新增 CPU 参考实现与 XPU3 wrapper
- `custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc`:使用 `PD_BUILD_STATIC_OP` 注册 custom op,输出 `top_p_padding``top_k_padding``topp_seed`
- `custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h`:声明 `build_sampling_params` 函数接口
- `fastdeploy/model_executor/layers/sample/sampler.py``_verify_and_sample_xpu` 替换 `padding_sampling_params``build_sampling_params``_normal_sample_xpu` 直接使用 `sampling_metadata.topp_seed`;为 `forward_xpu` / `_verify_and_sample_xpu` 新增 `increment_value` 参数
- `fastdeploy/worker/xpu_model_runner.py`:新增 `self.increment_value`(非 speculative=4,speculative=(N+1)*4);speculative 路径移除外部 `infer_seed.add_()` 更新,改由 kernel 内部维护
- 新增单测:`custom_ops/xpu_ops/test/test_build_sampling_params.py`(含纯 decoder、纯 encoder、混合、单 batch、seed wrap-around 等 6 个用例)

## Usage or Command
N/A

## Accuracy Tests
XPU kernel 内 INT64 取模验证正常(见 PR 附图)。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🟡 建议 custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc:28 infer_seed 以非 const 引用原地修改,但在 PD_BUILD_STATIC_OP 中仅声明为 .Inputs,缺少 SetInplaceMap 声明
🟡 建议 custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp:145 Wrapper 仅支持 XPU3,其他代次(如 XPU2)运行时会触发 WRAPPER_UNIMPLEMENTED
❓ 疑问 custom_ops/xpu_ops/ 新增 .cc / .cpp / .xpu 源文件,diff 未见 setup_ops.pyCMakeLists.txt 更新,请确认这些文件是否已加入编译链

总体评价

实现思路清晰,将采样参数 padding 与 infer_seed 更新下沉到 XPU kernel,减少主机端开销,并提供了覆盖多场景的单测。建议关注 Paddle custom op 框架的 inplace 声明语义和编译注册完整性。

std::vector<paddle::Tensor> BuildSamplingParams(
const paddle::Tensor& top_p,
const paddle::Tensor& top_k,
paddle::Tensor& infer_seed,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 infer_seed 以非 const 引用传入并在 kernel 内原地更新,但 PD_BUILD_STATIC_OP 中仅通过 .Inputs({"infer_seed"}) 声明为只读输入,未声明 SetInplaceMap

根据 Paddle custom op 规范,原地修改输入 tensor 需通过 SetInplaceMap 显式告知框架,否则在 AOT / Static Graph 场景下框架可能对输入创建副本,导致 infer_seed 的更新对外不可见(单测中对 seed.numpy() 的结果校验也将失败)。

建议将 infer_seed 同时加入 Outputs 并声明 inplace 映射:

PD_BUILD_STATIC_OP(build_sampling_params)
    .Inputs({"top_p", "top_k", "infer_seed", "seq_lens_this_time", "seq_lens_encoder"})
    .Outputs({"top_p_padding", "top_k_padding", "topp_seed", "infer_seed_updated"})
    .SetInplaceMap({{"infer_seed", "infer_seed_updated"}})
    ...

或直接将更新后的 infer_seed 作为第四个返回值输出。

token_num,
increment_value);
} else if (ctx->dev().type() == api::kXPU3) {
return xpu3_wrapper(ctx,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 Wrapper 当前仅对 api::kXPU3 提供实现,其他 XPU 代次(如 XPU2)会命中末尾的 WRAPPER_UNIMPLEMENTED(ctx) 并在运行时报错。

若此 op 仅针对 XPU3,建议在 op 注册层或调用侧加入硬件代次检查,提前给出明确的报错信息;若未来需要支持 XPU2,可在此处扩展对应分支。

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-13 14:49:37

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

1 个 Required 失败任务Approval),另有 3 个 Required 任务运行中(含主测试任务 run_tests_with_coverage),暂不建议合并,请等待 Required 任务完成后再评估。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 28 3 5 0 0

2 任务状态汇总

2.1 Required任务 : 6/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 7s 基础设施:PR未获必要审批 联系有权限团队成员审批PR Job -
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
其余 6 个必选任务通过 - - - - -

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
xpu_unit_test / run_xpu_unit_test 5m59s Job -
Check PR Template 16s Job -
CI_HPU - Job -
Run iluvatar Tests / run_iluvatar_cases - Job -
其余 22 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 基础设施(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 基础设施
  • 置信度: 高
  • 根因摘要: PR未获必要审批,审批流程检查失败
  • 分析器: 通用分析(fallback)

根因详情:
Approval 工作流是验证PR是否已获得必要团队成员审批的检查任务,该工作流仅运行7秒即失败,无详细错误注释(无 annotations 信息)。这是典型的审批流程检查失败模式,与PR代码变更内容无关。深度日志获取因网络TLS超时失败,但根据工作流名称和行为特征,可高置信度判断为PR尚未获得所需审批。

关键日志:

⚠️ 无 annotations 信息(深度日志获取失败:TLS handshake timeout)

修复建议:

  1. 联系有审批权限的团队成员对本PR进行Review并Approve
  2. 确认PR描述符合规范,满足审批前提条件

修复建议摘要: 联系有权限的团队成员审批本PR

链接: 查看日志

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants